9 research outputs found

    Exploiting the Volatile Nature of Data and Information in Evolving Repositories and Systems with User Generated Content

    Get PDF
    Modern technological advances have created a plethora of an extremely large, highly heterogeneous and distributed collection of datasets that are highly volatile. This volatile nature makes their understanding, integration and management a challenging task. One of the first challenging issues is to create the right models that will capture not only the changes that have taken place on the values of the data but also the semantic evolution of the concepts that the data structures represent. Once this information has been captured, the right mechanisms should be put in place to enable the exploitation of the evolution information in query formulation, reasoning, answering and representation. Additionally, the continuously evolving nature of the data hinders the ability of determining the quality of the data that is observed at a specific moment, since there is a great deal of uncertainty on whether this information will remain as is. Finally, an important task in this context, known as information filtering, is to match a specific piece of information which is recently updated (or added) in a repository to a user or query at hand. In this dissertation, we propose a novel framework to model and query data which have the explicit evolution relationships among concepts. As a query language we present an expressive evolution graph traversal query language which is tested on a number of real case scenarios: the history of Biotechnology, the corporate history of US companies and others. In turn, to support query evaluation we introduce an algorithm using the idea of finding Steiner trees on graphs which is capable of computing answers on-the-fly taking into account the evolution connections among concepts. To address the problem of data quality in user generated repositories (e.g. Wikipedia) we present a novel algorithm which detects individual controversies by using the substitutions in the revision history of a content. The algorithm groups the disagreements between users by means of a context, i.e. the surrounding content, and by applying custom filters. In the extensive experimental evaluation we showed that the proposed ideas lead to high effectiveness on a various sources of controversies. Finally, we exploit the problem of producing recommendations in evolving repositories by focusing on the cold start problem, i.e. when no or little past information about the users and/or items is given. In the dissertation we present a number of novel algorithms which cope with the cold-start by leveraging the item features using the k-neighbor classifier, Naive Bayes classifier and maximum entropy principle. The obtained results enable recommender systems to operate in rapidly updated domains such that news, university courses and social data

    Pay-as-you-go Feedback in Data Quality Systems

    Get PDF

    Supporting queries spanning across phases of evolving artifacts using Steiner forests

    Full text link
    The problem of managing evolving data has attracted considerable research attention. Researchers have focused on the modeling and querying of schema/instance-level structural changes, such as, ad-dition, deletion and modification of attributes. Databases with such a functionality are known as temporal databases. A limitation of the temporal databases is that they treat changes as independent events, while often the appearance (or elimination) of some structure in the database is the result of an evolution of some existing structure. We claim that maintaining the causal relationship between the two structures is of major importance since it allows additional reason-ing to be performed and answers to be generated for queries that previously had no answers. We present here a novel framework for exploiting the evolution relationships between the structures in the database. In particu-lar, our system combines different structures that are associated through evolution relationships into virtual structures to be used during query answering. The virtual structures define “possible” database instances, in a fashion similar to the possible worlds in the probabilistic databases. The framework includes a query answering mechanism that allows queries to be answered over these possible databases without materializing them. Evaluation of such queries raises many interesting technical challenges, since it requires the discovery of Steiner forests on the evolution graphs. On this prob-lem we have designed and implemented a new dynamic program-ming algorithm with exponential complexity in the size of the input query and polynomial complexity in terms of both the attribute and the evolution data sizes

    Modeling Concept Evolution: a Historical Perspective

    No full text
    Abstract. The world is changing, and so must the data that describes its history. Not surprisingly, considerable research effort has been spent in Databases along this direction, covering topics such as temporal models and schema evolution. A topic that has not received much attention, however, is that of concept evolution. For example, Germany (instance-level concept) has evolved several times in the last century as it went through different governance structures, then split into two national entities that eventually joined again. Likewise, a caterpillar is transformed into a butterfly, while a mother becomes two (maternally-related) entities. As well, the concept of Whale (a class-level concept) changed over the past two centuries thanks to scientific discoveries that led to a better understanding of what the concept entails. In this work, we present a formal framework for modeling, querying and managing such evolution. In particular, we describe how to model the evolution of a concept, and how this modeling can be used to answer historical queries of the form “How has concept X evolved over period Y”. Our proposal extends an RDF-like model with temporal features and evolution operators. Then we provide a query language that exploits these extensions and supports historical queries.

    Modeling and Mapping Multilingual and Historically Diverse Content

    No full text
    Abstract: Recent digitization efforts made archival content more available and even searchable through the Web. History researchers use this content for studying past events in relation to general historiographical issues, which may involve politics, society, ethics, and others. However, for locating relevant content the researchers need to know the terminology used for the topic of interest in the past. This problem is crucial for history researchers, because it affects the time and quality of their work. Another problem of great importance when dealing with the archival content is multilingualism. Simple translation is not enough to identify a relevant term in other language, because a term may undergo different changes in different social or cultural contexts. In order to address these challenges, the EU-funded project Papyrus aims to develop tool support for cross-disciplinary information retrieval of news content for historical research. To model both disciplines, i.e. history and news, we developed two ontologies. The News ontology reflects the perspective of news professionals on digital archives using the NewsML-G2 standard. Whereas the History ontology models the history perspective on the events and topics covered by the news. The History ontology is based on the CIDOC Concept Reference Model that embraces several standards of modeling information in the cultural heritage domain. To provide a means of communication between these two disciplines, we use mappings that establish correspondences between the News and History ontologies. This work discusses the major challenges in modeling and mapping terms and concepts describing the archival content that is multilingual and historically diverse
    corecore